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ABSTRACT 

In this paper, we present the EATS (Eeature Analysis for Time Series) library. EATS is a Python 
library which facilitates and standardizes feature extraction for time series data. In particular, we 
focus on one application: feature extraction for astronomical light curve data, although the library is 
generalizable for other uses. We detail the methods and features implemented for light curve analysis, 
and present examples for its usage. 

Subject headings: methods: data analysis - methods: statistical - stars: statistics - stars: variables: 
general - catalogs 


1. INTRODUCTION 

A time series is a sequence of observations, or data 
points, that is arranged based on the times of their oc¬ 
currence. The hourly measurement of wind speeds in 
meteorology, the minute by minute recording of electri¬ 
cal activity along the scalp in electroencephalography, 
and the weekly changes of stock prices in finances are 
just some examples of time series, among many others. 
Some of the following p roperties may be observed in time 
series data ( Ealk||2Q12 ): 


the data is not generated independently 
their dispersion varies in time 
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Fig. 1.— Example light curves of each class in the MACHO 
training set. 


• they are often governed by a trend and/or have 
cyclic components 


The study and analysis of time series can have multiple 
ends: to gain a better understanding of the mechanism 
generating the data, to predict future outcomes and be¬ 
haviors, to classify and characterize events, or more. 

In time domain astronomy, data gathered from tele¬ 
scopes is usually represented in the form of light curves, 
which are time series that show the brightness variation 
of an object through a period of time. Based on the vari¬ 
ability characteristics of the light curves, celestial objects 
can be classified into different groups (quasars, long pe- 
ri od variables, eclip sing binaries, etc., as shown in figure 
|5 ( Nun et ^|2Q14 )) and consequently can be studied in 
aepth independently. 


Classification of data into groups can be performed in 
several ways given light curve data: primarily, existing 
methods found in the literature use machine learning al¬ 
gorithms that group light curves into categories through 
feature extraction from the light curve data. These light 
curve features, the topic of this work, are numerical or 
categorical properties of the light curves which can be 
used to characterize and distinguish the different vari¬ 
ability classes. Eeatures can range from basic statistical 
properties such as the mean or the standard deviation 
to more complex time series characteristics such as the 


autocorrelation function. These features should ideally 
be informative and discriminative, thus allowing for ma¬ 
chine learning or other algorithms to use them to distin¬ 
guish between classes of light curves. 

In this document, we present a library which al¬ 
lows for the fast and efficient calculation of a com¬ 
pilation of many existing light curve features. The 
main goal is to create a collaborative and open tool 
where users can characterize or analyze an astronom¬ 
ical photometric database while also contributing to 
the library by adding new features. This arXiv docu¬ 
ment as well as the authors listed will be updated inso¬ 
far as new features are created. The reader can also 
visit the library website http://isadoranun.github. 
io/tsfeat/FeaturesDocumentation.html for more in¬ 
formation and a better visualization of the tool. It is 
important to highlight that this library is not necessarily 
restricted to the astronomical domain and can also be 
applied to any kind of time series data. 

Our vision is to be capable of analyzing and compar¬ 
ing light curves from any available astronomical cata¬ 
log in a standard and universal way. This would facil¬ 
itate and make more efficient tasks such as modeling, 
classification, data cleaning, outlier detection, and data 
analysis in general. Consequently, when studying light 
curves, astronomers and data analysts using our library 
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would be able to compare and match different features 
in a standardized way. In order to achieve this goal, 
the library should be run and features generated for ev¬ 
ery existent survey (MACHO, EROS, OGLE, Catalina, 
Pan-STARRS, etc.), as well as for future surveys (LSST), 
and the results shared openly, as is this library. 

In the remainder of this document, we provide an 
overview of the features developed so far and explain 
how users can contribute to the library. 

2. LIGHT CURVE FEATURES 

The next section details the features that we have de¬ 
veloped in order to represent light curves. Eor each fea¬ 
ture, we also describe a benchmark test performed in 
order to test the feature’s correctness. 

Each feature was also tested to ensure invarability to 
unequal sampling. Because telescope observations are 
not always taken at uniformly spaced intervals, the light 
curve features should be invariant t o thi s nonuniformity. 
These tests are explained in section [TT| 

2.1. Mean 

Mean magnitude 


m = 



(1) 


2.2. Standard deviation 

The standard deviation, a, of the sample is defined as: 


a = 



( 2 ) 


2.3. Mean varianee ^ 

This is a simple variability index and is defined as the 
ratio of the standard deviation, cr, to the mean magni¬ 
tude, fh. If a light curve has strong variability, ^ of the 
light curve is generally large. 

Eor a uniform distribution from 0 to 1, the mean is 
equal to 0.5 and the variance is equal to 1/12, thus the 
mean-variance should take a value close to 0.577. 


2.4. Median buff er range pereentaqe (M edianBRP) 
(Riehards et al\2011^ 

Eraction of photometric points within amplitude/10 of 
the median magnitude. 


2.5. Range of a eumulative sum Res {Kim et 

Res is the range of a cumulative sum (Ellaway 1978) 
of each light curve and is defined as: 


Res = max S — min S 

^ E 

i=l 


for / = 1, 2,..., A^. 

Res should take a value close to zero for any symmetric 
distribution. 


2.6. Period (Lomh-Seargle) \Kim et "00^201 1 ) 

The Lomb-Scargle (L-S) algorithm! Scargle 1982) is a 
common tool used in time series for period finding and 
frequency analysis. It is usually chosen over the Discrete 
Eourier Transform (DET) because it can handle unevenly 
spaced data points. In this peridiogram time series are 
decomposed into a linear combination of sine waves of the 
form y = acoscjt + ^sinco’t. By doing this fit, the data 
is transformed from the time domain to the frequency 
domain. The L-S peridiogram, is defined as: 




20-2 


- m) COS [U}(tn - t)] 


1 2 


cos2 [uj{tn - r)] 


- m) sin [u}{tn - t)] 


1 2 \ 


E^Li [^{tn - r) 


where uj = 27rT, T being the period and the time offset 
r is defined by: 


icvniflujr) 


En=l Sm{2ojtn) 
Yln=i cos(2wi„) 


On the other hand, periodic light curves can be trans¬ 
formed into a single light curve in which each period is 
mapped onto the same time axis as follows: 



where T is the period, to is an arbitrary starting point 
and the symbol {} represents the non-integer part of the 
fraction. This process produces a folded light curve on 
an x-axis of folded time that ranges from 0 to 1. 

In order to benchmark correctness of this feature, we 
calculate the period of a synthetic periodic time series 
and compare the obtained value with the real one. We 
also calculate the period of a known periodic light curve 
and fold it as shown in figure 

2.7. Period fit {Kim et al\20ll ) 

We define period fit as the false-alarm probability of 
the largest periodogram value obtained with the Lomb- 
Scargle algorithm. Its value should be close to zero for a 
periodic time series. 


2.8. Range of a eumulativ e sum on phase-fo lded light 
eurve, ^cs (Kim et al\2014 ) 

Range of a cumulative sum (sec |2.5[ ) applied to the 
phase-folded light curve (generated using the period es¬ 
timated from the Lomb-Scargle method). 


2.9. Periodie features extraeted from light eurves usi ng 
generalized LomhSeargle (Riehards et al.fMll ) 

Here, we adopt a model where the time series of the 
photometric magnitudes of variable stars is modeled as 
a superposition of sines and cosines: 


^i(^|/i) = CLi sin(27r/it) + bi cos(27r/^t) -|- bi^o 
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Fig. 2.— Synthetic time series before and after folding. 

where a and b are normalization constants for the si- 
nnsoids of frequency /^. and bi^o is the magnitude offset 
(Richards et al.||2011). To find periodic variations in the 
data, we fit the equation above by minimizing the sum 
of squares, which we denote 




= E 


{dk - mi{tk)Y 


( 3 ) 


where ak is the measurement uncertainty in data point 
dk- We allow the mean to fioat, leading to more robust 
period estimates in the case where the periodic phase is 
not uniformly sampled; in these cases, the model light 
curve has a non-zero mean. This can be important when 
searching for periods on the order of the data span Ttot- 
Now, define 


Xo = E 


(4 - fj-f 


( 4 ) 


k 

where /i is the weighted mean 


= 


Sfc 4/cTfc 




( 5 ) 


Then, the generalized Lomb-Scargle periodogram is: 

{N - 1 ) Xo - Xmif) 


PfU) = 


Xo 


(6) 


where Xmif) is X^ minimized with respect to a, b, and 


Following Debosscher et ah (2007), we fit each light 
curve with a linear term plus a harmonic sum of sinu¬ 
soids: 


3 4 

777-(t) = Ct -\- 

i=l j=l 


where each of the three test frequencies fi is allowed to 
have four harmonics at frequencies fij = jfi. The three 
test frequencies fi are found iteratively, by successfully 
finding and removing periodic signal producing a peak 
in Pf{f), where P/(/) is the Lomb-Scargle periodogram 
as defined above. 

Given a peak in P/(/), we whiten the data with respect 
to that frequency by fitting away a model containing that 
frequency as well as components with frequencies at 2, 
3, and 4 times that fundamental frequency (harmonics). 
Then, we subtract that model from the data, update 
and recalculate Pf{f) to find more periodic components. 


Algorithm 1 Periodic features extracted from light 
curves using generalized Lomb-Scargle 

1: procedure FindPeriodicComponents 
2: for z ^ 1, 2, 3 do 

3: Calculate Lomb-Scargle periodogram Pf(f) for light 

curve. 

4: Find peak in Pf(f)^ subtract that model from data. 

5: Update Xo- 


Then, the features extracted are given as an amplitude 
and a phase: 



where Aij is the amplitude of the jth harmonic of the 
ith frequency component and is the phase com¬ 

ponent, which we then correct to a relative phase with 
respect to the phase of the first component: 


PFT' — PFT 


PH, 


00 


( 10 ) 


and remapped to | — tt, -h7r|. 


2.10. Color \Kim et al\2011 ) 

This feature only applies to astronomical light curves 
and is defined as the difference between the average mag¬ 
nitude of two different bands observations. 
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p{t = kh) = 






rriiiti) rrijitj) 


Where h is the slot size, y is the normalized magnitude, 
p(0) is the slotted autocorrelation for the first lag, and 
W is the number of pairs that fall in the given slot. 
Again, we define the length of the slotted autocorrelation 
function as the lag value where the p(t = kh) becomes 
smaller than e~^. 

In order to check the validity of this feature we calcu¬ 
lated the slotted autocorrelation function for a normal 
distribution with h = 1 and compared it with the auto¬ 
correlation function. These two should be equivalent in 
this case where the time intervals are constant. 
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Fig. 3.— Blue and red band observations for a particular light 
curve. 


2.11. Autocorrelation function length (Kim et 'aUji2011 ) 


The autocorrelation or serial correlation, is the linear 
dependence of a signal with itself at two points in time. 
In other words, it represents how similar the observations 
are as a function of the time lag between them. It is 
often used for detect non-randomness in data or to find 
repeating patterns. 

For an observed series mi, m 2 ,..., ruT with sample 
mean m, the sample lag—h autocorrelation is given by: 


. _ - m){mt-h - m) 

kh ( - N2 

Since the autocorrelation function of a light curve is 
given by a vector and we can only return one value as a 
feature, we define the length of the autocorrelation func¬ 
tion as the lag value where becomes smaller than e~^. 


2.12. Slotted autocorrelation function length 

In slotted autocorrelation (Huijse et al. 2012), time lags 
are defined as intervals or slots instead of single values. 
The slotted autocorrelation function at a certain time lag 
slot is computed by averaging the cross product between 
samples whose time differences fall in the given slot. 
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Fig. 4. — Autocorrelation function and Slotted autocorrelation 
function for a normal distributed time series with T = 1. 


2.13. Stetson K, Stets on K_AC, Stetson J and Stetson 
L (Kim et al. JMnh 

These four featur es are based on the Welch/Stetson 
variability index I (Stetson 1996) defined by the equa¬ 
tion: 


I = 






where bi and Vi are the apparent magnitudes obtained 
for the candidate star in two observations closely spaced 
in time on some occasion i, and cr^^i are the standard 

errors of those magnitudes, h and v are the weighted 
mean magnitudes in the two filters, and n is the number 
of observation pairs. 

Since a given frame pair may include data from two 
filters which did not have equal numbers of observations 
overall, the “relative error” is calculated as follows: 


(5 = 


n — 1 ay 


allowing all residuals to be compared on an equal basis. 


2.13.1. Stetson K 

Stetson K is a robust kurtosis measure: 


iAELR 
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where the index i runs over all N observations avail¬ 
able for the star without regard to pairing. For a Gaus¬ 
sian magnitude distribution K should take a val ue close 


to y/ 2lTi = 0.798. The proof can be found in (Stetson 


T^. 


2.16. 

index calculated from the folded light curve. 


2.17. Small Kurtosis (Richards et al^Oll ) 


Small sample kurtosis of the magnitudes: 


2.13.2. Stetson K_AC 

Stetson K applied to the slotted autocorrelation func¬ 
tion of the light curve. 

2.13.3. Stetson J 

Stetson J is a robust version of the variability index. 
It is calculated based on two simultaneous light curves 
of a same star and is defined as: 

n 

J = '^sgn{Pk)^/\^\ 

/C = l 

with Pfe = 

For a Gaussian magnitude distribution, J should take 
a value close to zero. 


2.13.4. Stetson L 


Stetson L variability index describes the synchronous 
variability of different bands and is defined as: 


0.798 

Again, for a Gaussian magnitude distribution, L should 
take a value close to zero. 


2.14. Variability index (Kim et aL\\2014]) 

Variability index 77 (|Von Neumann et al. 1941) is the 
ratio of the mean of the square of successive dinerences 
to the variance of data points. The index was originally 
proposed to check whether the successive data points are 
independent or not. In other words, the index was devel¬ 
oped to check if any trends exist in the data. It is defined 


as: 


T] = 


{N-l)a^ 


N-l 

E 

i=l 


{rrii+i - rriif 


Although 77 is a powerful index for quantifying vari¬ 
ability characteristics of a time series, it does not take 
into account unequal sampling. Thus we use 77 ®, which 
is defined as: 


77® = to {In-i 


-ii) 


2 Yh=i Wi {rui+i - rriif 


E N — 1 
i=l Wi 


1 

Wi = -^ 

ik+i - tif 

The variability index should take a value close to two 
for uncorrelated normally distributed time series with 
mean equals to zero and large N. 


2.15. 


Variability index rjcoior 


(Kim et~oL^2014]/ 


77 ® index calculated from the color light curve (color as 
a function of time). 


N (V + 1) V 

(AT-1) (TV-2) (AT-3)^ 

3 (AT - 1)^ 

(N -2){N- 3) 

For a normal distribution, the small kurtosis should be 
zero. 

2.18. Skewness (ji) (Richards et a/.|[^d77 ) 

The skewness of a sample is defined as follow: 

N h mi — m 

= (N-l) (iV-2) ^ 

For a normal distribution it should be equal to zero. 
2.19. Median absolute devi ation ( MAD) (Richards et al. 

The median absolute deviation is defined as the median 
discrepancy of the data from the median data: 



MAD = median (|m — median(m)|) 


It should take a value close to 0.675 for a normal dis¬ 
tribution. This can be proven by using the interquartile 
ranges of a normal distribution. 


2 . 20 . Amplitude (Richards et al\2011 ) 

The amplitude is defined as the half of the difference 
between the median of the maximum 5% and the me¬ 
dian of the minimum 5% magnitudes. For a sequence of 
numbers from 0 to 1000 the amplitude should be equal 
to 475.5. 


2 . 21 . Percent amplitude (Richards et aL^2011 ) 


Largest percentage difference between either the max 
or min magnitude and the median. 


2 . 22 . Con (Kim et ^\2011 ) 

Index introduced fo r the selection o f variable stars from 
the OGLE database (|Wozniak||2000|) . To calculate Gon, 
we count the number of three consecutive data points 
that are brighter or fainter than 2 cr and normalize the 
number by N — 2. 

For a normal distribution and by considering just one 
star, Gon should take values close to 0.045 (area under a 
Gaussian curve within two sigma). 


2.23. Anders on-Darling test (Kim et 

The Anderson-Darling test is a statistical test of 
whether a given sample of data is drawn from a given 
probability distribution. When applied to testing if a 
normal distribution adequately describes a set of data, it 
is one of the most powerful statistical tools for detecting 
most departures from normality. 

For a normal distribution the Anderson-Darling statis¬ 
tic should take values close to 0.25. 
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2.24. Linear trend \Riehards et al.^2011 ) 

Slope of a linear fit to the light curve. 

2.25. Max slope l[Riehards et al^^MTl ) 

Maximum absolute magnitude slope between two con¬ 
secutive observations. 

2.26. Beyond 1 std (Riehards et al.^Oll ) 

Percentage of points beyond one standard deviation 
from the weighted (by photometric errors) mean. 

For a normal distribution, it should take a value close 
to 0.32 (area under a Gaussian curve within one sigma). 


2.32. CAR features (Piehara et a/.|[^d71| ) 

In order to model the irregul ar sampled times series 
we use CAR(l) ( Brockwell|2QQ2 ), a continuous time auto 
regressive model. CAK(l) process has three parameters, 
it provides a natural and consistent way of estimating 
a characteristic time scale and variance of light curves. 
CAR(l) process is described by the following stochastic 
differential equation: 


dX{t) = —^X{t)dt + ac\^e{t) 


hdt 


for T^cFc A ^ 0 


2.27. Pair slope trend {Riehards et al\20TD^ 

Considering the last 30 (time-sorted) measurements of 
source magnitude, the fraction of increasing first differ¬ 
ences minus the fraction of decreasing first differences. 

2.28. Flux pereentile ratio mid20, mid 35, mid 5 0, mid 
65 and mid 80 HRiehards et aLf2011 ) 

In order to characterize the sorted magnitudes distri¬ 
bution we use percentiles. If ^ 5^95 is the difference be¬ 
tween 95% and 5% magnitude values, we calculate the 
following: 


where the mean value of the light curve X{t) is hr 

2 

and the variance is r is the relaxation time of the 

process X(T), it can be interpreted as describing the 
variability amplitude of the time series, ac can be inter¬ 
preted as describing the variability of the time series on 
time scales shorter than r. e(t) is a white noise process 
with zero mean and variance equal to one. The likelihood 
function of a CAR(l) model for a light curve with obser¬ 
vations X — {xi,..., Xn} observed at times {ti,..., 
with measurements error variances {(5^,..., (5^} is: 


• flux_percentile_ratio_mid 20 : ratio F40,60/^5,95 

• flux_percentile_ratio_mid 35 : ratio F32.5,67.5/^5,95 

• flux_percentile_ratio_mid 50 : ratio ^25^5/^5^95 

• fiux_percentile_ratio_mid 65 : ratio ^17.5^82.5/^5,95 

• fiux_percentile_ratio_mid 80 : ratio Tio,90/^5,95 

For the first feature for example, in the case of 
a normal distribution, this is equivalent to calculate 

er/-b2-Q.6-l)-er/-b2-Q.4-l) o .. exDcrted values 
er/-q2-o.95-i)-er/-q2-o.05-i) • expecxea vaiucs 

for each of the flux percentile features are: 


• flux_percentile_ratio_mid20 = 0.154 

• flux_percentile_ratio_mid35 = 0.275 

• flux_percentile_ratio_mid50 = 0.410 

• flux_percentile_ratio_mid65 = 0.568 

• flux_percentile_ratio_mid80 = 0.779 

2.29. Pereent differenee fl ux pere entile (Riehards et ai 

Jm\ ) 

Ratio of ^ 5,95 over the median magnitude. 


2.30. 


Q3-1 


(Kim et~al\2014 ) 


Qs-i is the difference between the third quartile, Qs, 
and the first quartile, Qi, of a raw light curve. Qi is a 
split between the lowest 25% and the highest 75% of data. 
Qs is a split between the lowest 75% and the highest 25% 
of data. 


2.31. Q 3 -i\b-r (Kim et ol^2014 ) 

Qs-i applied to the difference between both bands of 
a light curve (B-R). 


p{x\b, (Jc,t) 


n 


1 


exp{ 


1 {Xi-X*f 

2 Qi + Jf ^ 


X* = Xi — hr 
xo =0 


^0 = 


2 


Xi — aiXi—\ -j- 


1 


= ^0 (1 ~ ^1 


^i-1 A 

^i-1 + ^i-i ) 


a- = ^-Oi-U-^)lT 

To find the optimal parameters we maximize the like¬ 
lihood with respect to gq and r and calculate b as the 
mean magnitude of the light curve divided by r. 


3. THE LIBRARY 
3.1. Installation 

The library is coded in python and can be down¬ 
loaded from the Github repository https://github. 
com/isadoranun/FATS, It is also possible to obtain 
the library by downloading the Python package from 
http://pypi.python.org/pypi/FATS or by directly in¬ 
stalling it from the terminal as follows: 


pip install FATS 

New features may be added by issuing pull requests 
via the Github version control system. 
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3.2. Input 

The library receives as input the time series data and 
returns as output an array with the calculated features. 
Depending on the available input the user can calculate 
different features. For example, if the user has only the 
vectors magnitude and time, just the features that need 
this data will be able to be computed. This will be deeper 
explained in section |3.4j In order to calculate all the 
possible features the following vectors (also termed as 
raw data) are needed per light curve: 

• magnitude 

• time 

• error 

• magnitude2 

• time2 

• error2 

where 2 refers to a different observation band. It is 
worth pointing out that the magnitude vector is the only 
input strictly required by the library given that it is nec¬ 
essary for the calculation of all the features. The remain¬ 
ing vectors are optional since they are needed just by 
some features. In other words, if the user does not have 
this additional data or he is analyzing time series other 
than light curves, it is still possible to calculate so me of 
the features. More details are presented in section 

When observed in different bands, light curves of a 
same object are not always monitored for the same time 
length and at the same precise times. For some features, 
it is important to align the light curves and to only con¬ 
sider the simultaneous measurements from both bands. 
The aligned vectors refer to the arrays obtained by syn¬ 
chronizing the raw data. 

Thus, the actual input needed by the library is an ar¬ 
ray containing the following vectors and in the following 
order: 

• magnitude 

• time 

• error 

• magnitude2 

• aligned magnitude 

• aligned magnitude2 

• aligned time 

• aligned error 

• aligned error2 

To exemplify how the input should be, we provide a 
basic toolbox for importing and preprocessing the data 
from the MACHO survey. The function ReadLC_MACHO 
receives a MACHO id of a light curve as an input, pro¬ 
cesses the file and returns the following output: 

• mag: magnitude measurement 


• time: time of measurement 

• error: associated observational error 

Besides ReadLC_MACHO, the toolbox also provides the 
following utility functions: 

• Preprocess_LC: points within 5 standard devia¬ 
tions from the mean are considered as noise and 
thus are eliminated from the light curve. 

• Align_LC : it synchronizes the light curves in 
the two different bands and returns aligned_mag, 
aligned_mag2, aligned_time, aligned_error 
and aligned_error2. 

Note: mag, time, and error must have the same 
length, as well as aligned_mag, aligned_mag2, 
aligned_time, aligned_error, and aligned_error2. 

The following code is an example of how to use the 
reading MACHO toolbox: 

#Open the light curve for two different 
bands 

lc_B = FATS.ReadLC.MACHO(' lc_58.6272.729.B. 
mjd ' ) 

lc_R = FATS.ReadLC_MACHO(' lc_58.6272.729.R. 
mjd') 

#Import the data 

[mag, time, error] = lc_B.ReadLC() 

[mag2, time2, error2] = lc_R.ReadLC() 

ttPreprocess the data 

preprocessed_data = FATS.Preprocess_LC(mag, 
time, error) 

[mag, time, error] = preprocessed_data . 
Preprocess () 

preprocessed_data = FATS.Preprocess_LC(mag2 
, time2 , error2) 

[mag2, time2 , error2] = preprocessed_data. 
Preprocess () 

# Synchronize the data 
if len(mag) != Ien(mag2): 

[aligned_mag , aligned_mag2 , 

aligned_time, aligned_error, 
aligned_error2] = FATS.Align_LC( 

time, time2, mag, mag2, error, 
error2) 

Ic = np.array([mag, time, error, mag2 , 

aligned_mag , aligned_mag2 , aligned_time 
, aligned_error, aligned_error2]) 

3.3. Structure 

The library structure is divided into two main parts. 
Part one: Feature.py, is a wrapper class that allows 
the user to select the features to be calculated based on 
the available time series vectors or to specify a list of 
features. Part two: FeaturePunciontLib.py, contains 
the actual code for calculating the features. Each feature 
has its own class with at least two functions: 

init: receives the necessary parameters (when needed) 
for the feature calculation. It also defines the required 
vectors for the computation (e.g. magnitude and time). 



fit: returns the calculated feature. The output can 
only take one value; features like the autocorrelation 
function must consequently be summarized in one sin¬ 
gle scalar. 

The following code is an example of a class in Fea- 
turePunciontLib.py that calculates the slope of a 
linear fit to the light curve: 

class LinearTrend(Base): 
def ( self ) : 

self .Data=[ 'magnitude' , 'time ' ] 
def fit( self ,data) : 

magnitude = data [0] 
time = data [1] 

regression_slope = stats.linregress 

( 

time, magnitude) [0] 

return regression_slope 

If the user wants to contribute with features to the li¬ 
brary, they must be added to FeaturePunctionLib.py 
following the explained format. There is no need to mod¬ 
ify Feature.py. The user should also add a va lidit y test 
to the unit test file (see explanation in section [T^ ). 

3.4. Choosing the features to ealeulate 

The library allows the user to either choose the specific 
features of interest to be calculated, or to calculate them 
all simultaneously. Nevertheless, as already mentioned, 
the features are divided depending on the input data 
needed for their computation (magnitude, time, error, 
second data, etc.). If unspecified, this will be used as an 
automatic selection parameter. For example, if the user 
wants to calculate all the available features but only 
has the vectors magnitude and time^ only the features 
that need magnitude and/or time as an input will be 
computed. 

Note: some features depend on other features and 
consequently must be computed together. For instance. 
Period-fit returns the false alarm probability of the 
estimated period. It is necessary then to calculate also 
the period PeriodLS. 

The list of all the possible features with their corre¬ 
sponding input data, additional parameters and litera¬ 
ture source is presented in Table 

The possible ways of how an user can choose the fea¬ 
tures from the library to be calculated are presented next. 

3.4.1. List of features as input 

The user can also specify a list of features as input 
by specifying the features as a list for the parameter 
featureList. In the following example, we aim to cal¬ 
culate the standard deviation and Stetson L of the data: 

a = FATS.FeatureSpace(featureList=[ 'Std' , ' 

StetsonL' ]) 

a = a.calculateFeature (Ic) 


3.4.2. Available data as input 

In case the user does not have all the input vectors 
mentioned above, it is necessary to specify the available 
data by specifying the list of vectors using the parameter 
Data. In the example below, we calculate all the features 
that can be computed with the magnitude and time as 
an input. 

a = FATS.FeatureSpace(Data=[' magnitude' , ' 

time ' ]) 

a = a . calculateFeature (Ic) 

3.4.3. List of features and available data as input 

It is also possible to provide the available time series 
input vectors and calculate all possible features from a 
feature list using this data: 

a = FATS.FeatureSpace(featureLiSt=[' Mean' , 

'BeyondlStd' , 'CAR_sigma' , 'Color' , ' 

SlottedA_length' ], Data = ['magnitude', 

'error ' ] ) 

a = a . calculateFeature(Ic)} 

In this case only the features Mean and BeyondlStd 
could be computed given the data, and warnings would 
be thrown for the other features: 

Warning: the feature CARsigma eould not be ealeu- 
lated beeause [‘magnitude^ dime’, drrorf are needed. 

Warning: the feature Color eould not be ealeulated 
beeause [^magnitude’, dime’, ‘magnitude2’] are needed. 

Warning: the feature SlottedAJength eould not be eal¬ 
eulated beeause [‘magnitude’, dime’] are needed. 

3.4.4. Excluding list as input 

The user can also create a list of features to be 
excluded from the calculation. To do so, the list of 
features to be excluded can be passed as a list via the 
parameter excludeList. For example: 

a = FATS.FeatureSpace(Data=[' magnitude' , ' 

time', 'error'], excludeList=[ ' 
SlottedA_lengtli ' , ' StetsonK_AC ' , ' 

PeriodLS ' ]) 

a = a.calculateFeature(Ic) 

The following list of features would be calcu¬ 
lated: Amplitude, Anders onDarling, Autoeor-length, 

BeyondlStd, CAR-mean, CARsigma, CAR-tau, Con, 
Eta-e, FluxPereentileRatioMid features. Harmonies 
features, LinearTrend, MaxSlope, Mean, Meanvari- 
anee, MedianAbsDev, MedianBRP, PairSlopeTrend, 
PereentAmplitude, PereentDiffereneeFluxPereentile, Pe¬ 
riod-fit, Psi-CS, Psi-eta, Q31, Res, Skew, SmallKurtosis, 
Std, and StetsonK 

and the following warnings would be displayed: 
Warning: the feature Color eould not be ealeulated 
beeause [‘magnitude’, ‘time’, ‘magnitude2’] are needed. 

Warning: the feature Eta-eolor eould not be ealeulated 
beeause [‘magnitude’, ‘time’, ‘magnitude2’] are needed. 
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Feature 

Depends on 

Input data (besides magnitude) 

Parameters 

Default 

Ref 

Amplitude 





1 Richards et al. 

ILoiil II 

AndersonDarling test 





n 




AutocorJength 



Number of lags 

100 

1 

IKim et al. ^ 


ti 

BeyondIStd 


error 



“|r 

ichards et al. 

.miu II 

CAR mean 

CAR-sigma 

time, error 




Pichara et al. 



CAR sigma 


time, error 




Pichara et al. 



CAR.tau 

CAR-sigma 

time, error 




E 

ichara et al. 



Color 


mag2 






Con 



Consecutive stars 

3 




Eta color 


aligned mag, aligned mag2, aligned time 




IWftWdilillljHilEl 


Eta e 


time 





L 


FluxPercentileRatioMid20 






Richards et al. 



FluxPercentileRatioMid35 






Richards et al. 



FluxPercentileRatioMidSO 






Richards et al. 



FluxPercentileRatioMid65 






Richards et al. 



FluxPercentileRatioMid20 






Richards et al. 

||2m 


FluxPercentileRatioMidSO 






Richards et al. 

iImtt 


Freql harmonics amplitude 0 


time 




Richards et al. 

iImtt 


Freq{i}harmonics amplitude{j} 


time 




Richards et al. 



Freq{i}harmonics rel phase{j} 


time 




Richards et al. 



Linear Trend 


time 




Richards et al. 



MaxSlope 


time 




Richards et al. 



Mean 





n 

Kim et al. ^ 



Meanvariance 





rd 

Kim et al. ^ 


bi 

MedianAbsDev 






Richards et al. 



MedianBRP 






Richards et al. 



PairSlopeTrend 






Richards et al. 



Percent Amplitude 






Richards et al. 



PercentDifferenceFluxPercentile 






Richards et al. 


3 


PeriodLS 


time 

Oversampling factor 

6 



Kim et al. 


c 


Period-fit 


time 




Kim et al. 

2011 


Psi CS 


time 




Kim et al. 



Psi eta 


time 




Kim et al. 



Q31 






Kim et al. 



Q31-color 


aligned-mag, aligned-mag2 




Kim et al. 

ml 


Res 






Kim et al. ^ 

mT 



Skew 





d 

[Richards et al. 


C 

SlottedA-length 


time 

Slot size T (days) 

4 

1 

Totopapas et a 



SmallKurtosis 






Richards et al. 


r 

Std 






Richards et al. 



StetsonJ 


aligned mag, aligned mag2, aligned-error, aligned-error2 




Richards et al. 



StetsonK 


error 




Richards et al. 

II20111 


StetsonK AC 






n 

Kim et al. ^ 

2011 

r 


StetsonL 


aligned-mag, aligned-mag2, aligned-error, aligned-error2 




Kim et al. ^ 

mT 


Variability Index 






Kim et al. ^ 

mT 



TABLE 1: List of features 
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Warning: the feature Q31-eolor eould not be caleulated 
beeause [^magnitude’, dime’, ^magnitude2’] are needed. 

Warning: the feature StetsonJ eould not be ealeulated 
beeause [‘magnitude’, ‘time’, ‘error’, ‘magnitude2’, 
‘error2’] are needed. 

Warning: the feature StetsonL eould not be ealeulated 
beeause [‘magnitude’, ‘time’, ‘error’, ‘magnitude2’, 
‘error2’] are needed. 

error: please run PeriodLS first to generate values for 
Period-fit 

error: please run PeriodLS first to generate values for 
PsLCS 

error: please run PeriodLS first to generate values for 
PsLeta 


• we calculate the percentage difference between the 
original features’ values and the ones obtained from 
the random sampling. 

4.2. Unit tests 

In order to systematically check the correctness of the 
feature generation implementation in our library, a unit 
test is created for each feature in a unit test file named 
test_library .py. This script should be always run be¬ 
fore using the library by executing $ py.test at the 
command line. In addition, if a user contributes a new 
feature for the library, a pertinent test should be added 
to the unit test file. The idea is to guarantee, as far as 
possible, that every feature calculated is correct. In most 
cases, this can be reached by calculating the expected fea¬ 
ture value for a known distribution (normal, uniform, or 
otherwise), and then checking it with the value obtained 
from the library. 

The following image shows how py.test output should 
look if all the tests are passed: 


3.4.5. All the features 

To calculate all the existing features in the library, the 
user can set the Data parameter to the string ' all' and 
set the featureList parameter to be None. Obviously, 
in order to do this, the user must have also provided all 
the necessary time series input vectors. 

3.5. Library output 

When calculating the features of a light curve, the out¬ 
put can be returned in three different formats: 

• diet: returns a vector with the name of each feature 
followed by its value. 

• array: returns a vector with only the features val¬ 
ues. 

• features : returns a vector with the list of the cal¬ 
culated features. 

The output format can be specified via a string for the 
parameter method, as shown in the example below: 

a = FATS.FeatureSpace(Data= 'all' , 
featureList =None) 
a = a.calculateFeature(Ic) 
print a.result(method= 'diet' ) 


alatform darwin — Python 2.7.8 — py-1.4.25 — pytest-2.6.3 
alugins: cov 
collected 17 items 

test_library.py . 


Fig. 5.— Unit test output. 


4.3. Classifieation test 

As a way of checking the usefulness of our features we 
trained a classifier and obtained the classification accu¬ 
racy for labeled objects that were not used in the train¬ 
ing set. In particular we used a Random Forest classi¬ 
fier and calculated the F-score of the out-of-bag (oob) 
samples as done in |Nun et al.| (|20I4|) . Since trees are 
constructed from dinerent bootstrapped samples of the 
original data, about one-third of the cases are left out of 
the “bag” and not used in the construction of each tree. 
By putting these oob observations down the trees that 
were not trained with oob data, we end up with unbiased 
predicted labels. The training set we used for the test 
is composed of a subset of 6063 labeled obser vations and 
8 diff erent classes from the MACHO catalog (Kim et al. 
2011|P] The constitution of the training set is presented 
WTa5le|2l 


4. TESTS 

4.1. Robustness tests 

The following section presents the tests performed to 
the features in order to check their invariance to unequal 
sampling Q To do so, we took random observations of 
a light curve and compared the resulting features with 
the ones obtained from the original data. The steps we 
followed are described next: 

• we calculate the features for fifty random samples 
of one original light curve. 

• we obtain the mean and standard deviation of each 
calculated feature. 

^ The code is also available in the github repository 



Class 

Number of objects 

1 

Be Stars 

127 

2 

Cepheid 

101 

3 

Eclipsing Binaries 

255 

4 

Long Period Variable 

365 

5 

MicroLensing 

380 

6 

Non variables 

3966 

7 

Quasars 

59 

8 

RR Lyrae 

610 


TABLE 2: Training Set Composition 


By using 500 trees we obtain a F-score of 0.97 and the 
following confusing matrix: 

^ We collected these variables from the MACHO vari¬ 
able catalog found at: http://vizier.u-strasbg.fr/viz-bin/VizieR?- 
source=II/247 
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Normalized confusion matrix 
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0.5 
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Fig. 6. — Confusion matrix for the MACHO training set 
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We also obtained the importance ranking of the fea¬ 
tures in the classification task as shown in figure To 
understand how the im portance i s calc ulated the reader 
can find information in 


Breiman (2001). 


5. HOW TO CONTRIBUTE 


A user can contribute to this work in two different 
ways: 

• by adding a new feature to the library with its per¬ 
tinent description, validity test and unit test. 


• by further developing or correcting an already ex¬ 
isting feature. 

In the first case, the user can directly download the 
github package of the library and push a new version 
with the added feature. After being checked, it will be 
officially uploaded and the user will be added as a co¬ 
author to this arXiv article. In the second case, the user 
can contact us by email with any suggestions. 

6. UPCOMING FEATURES 

The latest release of the package is FATS 1.3.4, up¬ 
loaded on April 8, 2015. The next features to be added 
are: 


• structure function descriptor 




slepian wavelet variance 


( Graharn]|2Q15 ) 


7. SUMMARY 


In this article we presented FATS, a python library for 
times series feature extraction. In particular we focused 
on astronomical times series (light-curves), nevertheless 
the tool is not restricted to this type of data. We in¬ 
cluded the main descriptors for times series in a sim¬ 
ple collaborative package. Several tests were performed 
to check their robustness and invariance to nonuniform 
sampling. We also proved the suitability of the features 
for classification by obtaining a high F-score of 0.97 with 
a Random-forest classifier and data from the MACHO 
catalog. FATS is under continuous development and fea¬ 
tures will be constantly added. The author list will be as 
well updated insofar as users send us their contribution. 
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Color 
Psi_CS 
Psi_eta 
SlottedA_length 
StetsonJ 
Con 
Stetson L 
PeriodLS 
Res 

Stetson K_AC 
CAR_tau 
Stetson K 
FluxPercentileRatioMidSO 
FluxPercentileRatioMid65 
CAR_tmean 
Skew 
Mean 
Period_fit 
Eta_e 
Autocor_length 
FluxPercentileRatioMid35 
FluxPercentileRatioMid20 
FluxPercentileRatioMidSO 
BeyondlStd 
MedianBRP 
SmallKurtosis 
Eta_color 
MaxSIope 
QSlcolor 
PercentAmplitude 
Freql_harmonics_amplitude_0 
CAR_sigma 
Amplitude 
Meanvariance 
MedianAbsDev 
PercentDifferenceFluxPercentile 
031 

Freql_harmonics_amplitude_l 

Std 

Freq3_harmonics_amplitude_0 

Freql_harmonics_amplitude_2 

Freql_harmonics_amplitude_3 

Freq2_harmonics_amplitude_0 

LinearTrend 

Freq2_harmonics_amplitude_2 

Freq2_harmonics_amplitude_3 

Freq3_harmonics_amplitude_3 

Freq2_harmonics_amplitude_l 

Freq3_harmonics_amplitude_2 

Freq3_harmonics_amplitude_l 

Freq2_harmonics_rel_phase_2 

Freq3_harmonics_rel_phase_2 

Freq3_harmonics_rel_phase_3 

Freq3_harmonics_rel_phase_l 

Freq2_harmonics_rel_phase_l 

Freq2_harmonics_rel_phase_3 

Freql_harmonics_rel_phase_3 

Freql_harmonics_rel_phase_l 

Freql_harmonics_rel_phase_2 

PairSlopeTrend 

AndersonDarling 

Freq3_harmonics_rel_phase_0 

Freql_harmonics_rel_phase_0 

Freq2_harmonics_rel_phase_0 
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Fig. 7.— Feature importance 
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